I'm reading the paper Distributional Reinforcement Learning via Moment Matching. In it the authors mention that while their method is based on QRDQN (generating static particles a la quantiles in QRDQN directly from Neural Net), it can be easily extended to IQN and FQF. They mention that instead of generating fixed network outputs, we can generate the particles via applying a neural network function to samples of a base sampling distribution as in IQN. What I don't understand is how we derive the gradients for the sample generating network? The loss for MMDDQN (based off of QRDQN) is done as follows (this is what I understand):given expected particles and target particles, we generate deltas, calculate some kernel function of it, and then calculate the maximum mean discrepancy. See code below
batch_size, n_particles = expected_particles.shape assert expected_particles.shape == target_particles.shape delta1 = expected_particles[:, :, None] - expected_particles[:, None, :] delta2 = target_particles[:, :, None] - target_particles[:, None, :] delta3 = expected_particles[:, :, None] - target_particles[:, None, :] first_kernel = huber_loss(delta1, kappa) second_kernel = huber_loss(delta2, kappa) third_kernel = huber_loss(delta3, kappa) for x in (first_kernel, second_kernel, third_kernel): assert x.shape == (batch_size, n_particles, n_particles), x.shape first_item, second_item, third_item = ( kernel_func(x).mean() for x in (first_kernel, second_kernel, third_kernel) ) loss = first_item + second_item - 2 * third_item
How do the samples generated from base sampling distribution in IQN (taus) fit in the above loss function? In IQN, the loss is derived from the td_error = expected_quantiles - target_quantiles and the generated taus by combining them using quantile huber loss. For extending MMD to IQN, do we just multiply the taus to first_kernel, second_kernel and third_kernel above and then calculate the kernel function of the results?
submitted by /u/SirRantcelot
[link] [comments]